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Abstract 

Observations depending on sums of random variables are common through¬ 
out many fields; however, no efficient solution is currently known for per¬ 
forming max-product inference on these sums of general discrete distri¬ 
butions (max-product inference can be used to obtain maximum a pos¬ 
teriori estimates). The limiting step to max-product inference is the 
max-convolution problem (sometimes presented in log-transformed form 
and denoted as “infimal convolution”, “min-convolution”, or “convolution 
on the tropical semiring”), for which no 0(fclog(fc)) method is currently 
known. Here I present a 0(fclog(fc)) numerical method for estimating the 
max-convolution of two nonnegative vectors {e.g., two probability mass 
functions), where k is the length of the larger vector. This numerical max- 
convolution method is then demonstrated by performing fast max-product 
inference on a convolution tree, a data structure for performing fast in¬ 
ference given information on the sum of n discrete random variables in 
0(nfc log(nfc) log(n)) steps (where each random variable has an arbitrary 
prior distribution on k contiguous possible states). The numerical max- 
convolution method can be applied to specialized classes of hidden Markov 
models to reduce the runtime of computing the Viterbi path from nk^ to 
nfclog(fc), and has potential application to the all-pairs shortest paths 
problem. 


1 Introduction 

In many fields it is common to have access to information about sums of 
random variables and to desire information about those variables themselves. 
In mass spectrometry, when two (or more) analytes with similar mass-to-charge 
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are measured, the intensity of the resulting peak is a function of the sum of abun¬ 
dances of those analytes (this problem occurs not only in the mass spectrometry 
of small molecules, but also in measuring isotope measurement in elemental and 
nuclear mass spectrometry). In transcriptomics, the abundance of a particu¬ 
lar non-unique read (i.e., an RNA sequence that maps to multiple locations 
in the transcriptome or genome) provides information about the sum of the 
abundances of all transcripts that contain the read (each transcript weighted 
by how many copies of the read it carries). Proteomics has its own version 
of non-unique reads, shared peptides which can be found in multiple proteins 
(not only are shared peptides the principal source of difficulty in protein infer¬ 
ence [T4|[l7|[I^ , they are also responsible for the difficulty evaluating putatative 
sets of discovered proteins [I^[T^ ). In population genetics, the prior knowledge 
about population structure can suggest an expected number of individuals with 
a particular genotype, which in turn yields probabilistic information about the 
individuals whose aggregate genotypes are expected to produce that sum (infer¬ 
ence is particularly pronounced in polyploids, which increase the dimensionality 
of the problem 
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In all of these fields, the information on sums of random variables presents a 
singular obstacle to computational biology. And regardless of how infrequently 
we as scientists directly discuss our current inability to effectively utilize infor¬ 
mation about sums of random variables, the perception of our limited ability to 
meet the challenge has become firmly entrenched in our collective unconscious; 
the silent agreement on our inability to turn the sausage grinder backwards and 
convert the sausage (information about the sum of several random variables) 
back into the pigs (information about those random variables that contributed 
to the sum) is so well-established that it not only defines the way that we address 
these data (be., mass spectra peaks containing multiple analytes, counts of non¬ 
unique reads, etc.), but it also causes us to discard data and even limit research 
directions we might otherwise consider. For instance, in mass spectrometry, 
substantial effort is invested in chromatography [2jl^ and other separation tech¬ 


niques 12 , which aim to distinguish and separate analytes so that they will not 
be measured by the mass spectrometer simultaneously (thereby reducing the 
chances of analytes resulting in overlapping peaks); the task of decomposing 
this useful aggregate information from overlapping peaks back into informa¬ 
tion about its contributing parts is eschewed in favor of significant investments 
in instrumentation {e.g., more and more advanced separation technologies and 
higher-resolution mass spectrometers [11| ), and still it is common practice to dis¬ 
card shared isotope peaks and shared peptides even though they may comprise 
a large percent of the data and contain additional information |5, 19 . Likewise, 
in genomics and transcriptomics it is common to simply discard all data from 
non-unique reads 10 23 . In burgeoning fields such as metagenomics, this loss 


of data- and subsequent loss of information- can be even more pronounced, 
for instance when all data that map to two or more species of interest are dis¬ 
carded. In some cases, recovering this lost information will be the key to making 
strong conclusions, such as distinguishing between two closely related species 
(or bacterial strains) in a metagenomic mixture. 
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Sum-product inference 

Fast Fourier transform- (FFT-)based convolution can be used to dramati¬ 
cally improve the efficiency of computing the sum-product addition of two dis¬ 
crete random variables. For three discrete random variables where M = L + R 
and where L G {0, 1,.. .k — 1} and R G {0,1,... fc — 1}, then the probability 
mass function (PMF) of M can be computed via the convolution of the PMFs 
of L and R, denoted pmf^ and pmf^ respectively. Note that it is sufficient to 
compute pmfjy^ oc pmf^, because the result will be scaled so that its sum is of 
unity (pmfjv^ = ^ ): 

pmfjvf[m] oc EE Pr(L = £) Pr(i? = r) Pr(M = £ + r) 

I r 

= Pr(L = 1) Pt{R = m — t) 
i 

= pmf^^ * pmfjj 


where * performs the convolution between the two A:—length vectors storing 
each PMF. 

Whereas naive convolution would compute pmfj^^ in 0{k x k) steps, FFT 
exploits the bijection of this convolution to the product between two polynomials 
(where the vectors being convolved are the coefficients of the polynomials being 
multiplied and the coefficients of their product forms the vector result); this 
bijection enables the use of alternative forms for representing the polynomials 
(each order k—1 polynomial can be represented through k unique points through 
which it passes), which in turn permits elegant divide and conquer algorithms 
such as the Cooley-Tukey FFT to compute fast convolution in fclog(fc) steps. 
Subtraction in the sum-product (*.e., computing L = M — R) scheme can be 
performed by first negating R' = —R (this is done by reversing reversing the 
vector storing the PMF pmf^, [r] = pmf^[—r]), and then adding L = M + R' as 
before via FFT convolution pmf)^ = pmf^ * pmf^,. Also, the runtime constant 
on FFT convolution algorithms is generally very low, partly due to the nature 
of elegance of the algorithms and partly because implementations have been 
optimized heavily due to the ubiquity of convolution in signal processing. 

The task of processing information about the sum of n discrete variables 
(each with k bins) to retrieve information on the individual variables can be 
performed naively in 0{k^) steps by simply enumerating the exponentially many 
possible outcomes; however, such brute-force techniques are wildly inefficient 
when either n ot k become large. Fortunately, recent work proposes methods to 
decompose larger problems {e.g., into multiple sums and differences of pairs of 
discrete variables of the form M = L + R and L = M — R, very fast inference can 
be achieved. This has been derived for binary variables (fc = 2) in nlog(n) log(n) 
21 , and was independently discovered for arbitrary discrete distributions {i.e., 
where k > 2) and to multidimensional distributions (via matrix convolution, 
which can be decomposed into one-dimensional convolutions by the row-column 
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algorithm) using the probabilistic convolution tree (algorithm Q. In the 
general case, distributions on all individual variables conditional on information 
about the sum can be computed in 0{nk log(nfc) log(n)) steps (whenever k log(/c) 
fast convolution is available): 


convolutionTreeCost(n,k) = 


G 


login) 

— convolutionCost{k2 

U — 1 

login) 

^ fc2“log(fc2“) 

U—1 

login) 

nk E log(fc) + log(2“) 

U — 1 
login) 

nk ^ log(fc) + u 

u—1 
login) 

nk IE log(fc) 


login) 

E- 

U = 1 


nk [log(n) log(fc)] - 
0{nk \og{nk) log(n). 


log(n)(log(n) + 1) 


In practice, this can be significantly faster than the 0{n^k'^) steps required by 
dynamic programming when fast convolution is not available. For instance, 
when an observed transcript fragment could originate from n = 256 species, 
and where the abundance of each species is discretized into k = 1024 bins, then 
fast convolution makes inference more than 1800 times faster (the difference 
between one algorithm taking 1 second and the other taking 30 minutes), and 
the disparity only grows for problems with larger values of n and k. It should 
be noted that these are only approximate runtimes calculated from the Big O 
form; in practice, it is fairly likely that methods based on fast convolution will 
be significantly faster, because of the method’s inherent properties and the fact 
that very optimized implementations exist. 

Max-product inference 

Qualitatively, max-product inference is a close cousin to sum-product infer¬ 
ence. Where sum-product inference considers each of the exponentially many 
joint events and allows each to contribute to the result (in hidden Markov mod¬ 
els, this is analogous to the forward-backward algorithm), max-product infer¬ 
ence allows only the highest-quality joint events to contribute (in hidden Markov 
models, this defines the Viterbi path). Both inference methods have complemen¬ 
tary advantages and disadvantages: The advantage of sum-product inference 


4 









Algorithm 1 The probabilistic convolution tree algorithm utilizes fast 
convolution (or fast max-convolution) to efficiently turn information on sums of 
variables back into information on the individual variables. The first parameter 
(pmf , pmf. pmf) is a collection of n multidimensional discrete distri¬ 
butions (with same dimension). In the case of univariate distributions, they are 
one-dimensional PMFs with k possible outcomes. The second parameter pmf^ 
is a multidimensional discrete distribution (with same dimension as Xi, X 2 , ■ ■.) 
where M = Xi -|- X 2 -I-... -I- A„. The third parameter is a convolution operator 
{e.g., either standard convolution or max-convolution). The algorithm returns 
a pair of values. The first is a collection of likelihood distributions given on 
the information from the sum M, (pmfy^,pmfy-^,.. .pmfy^). The second is the 
prior distribution of M after adding together all Xi + X 2 + ■ ■ ■ A„. When n 
is not an integer power of 2, dummy variables whose PMFs of length 1 (ie., 
PMFs that are Kronecker deltas with 100% chance of having value 0) should be 
padded on until the next power of 2 is reached (this will allow construction of 
a full binary tree without changing the sum). 

1: procedure PROBABILISTICCONVOLUTIONTREE((pmfjf^ , ,... pmf^^), 

pmfM 1 *param) 

2: forwardTree ^ [ [pmf, pmf^^,... pmf^^] ] 

3: for i = 1 to len{log 2 (n)) i+ = 1 do 

4: currentLayer forwardTree[—l] > Where [—1] gives the last index as 

in Python 

5: newLayer -h- [ ] 

6: for j = 1 to len{currentLayer) — 1, j+ — 2 do 

7: sum -h- currentLayer[j] *param currentLayer[j + 1] 

8: newLayer.append{normalized{sum))) t> Add the PMFs using the 

^param Operator 

9: end for 

10: forwardTree.append{new Layer) 

11: end for 

12: reverseTree [ [pmfjy^] ] 

13: for i = 1 to len{log 2 (n)), i+ = 1 do 

14: currentLayer reverseTree)—1] 

15: newLayer <r- [ ] 

16: forwardTreeLayerToSubtract forwardTree[—i — 2] > Python 

notation for the second to last layer in forwardTree 
17: for j = 1 to len{currentLayer), j-\- = 1 do 

18: Ihs <r- forwardTreeLayerToSubtract[2j] 

19: rhs forwardTreeLayerToSubtract[2j -|- 1] 

20: rhsSubtracted <r- currentLayer[j] *param {—rhs) > Negation reverses 

the vector 

21: t> Then the *param operator performs addition 

22: newLhs normalized{rhsSubtracted) 

23: IhsSutracted ^ currentLayer[j] *param {—Ihs) 

24: newRhs <r- normalized{lhsSubtracted) 

25: newLhs newLhs.narrowToTheSupportOf {Ihs) 

26: newRhs <r- newRhs.narrowToTheSupportOf {rhs) 

27: newLayer.append{newLhs) 

28: newLayer.append{newRhs) 

29: end for 

30: reverseTree.append{newLayer^ 

31: end for 

32: Vj, pmfy-^. reverseTree)—l][ji] 

33: pmf^ forwardTree[—l][0] 

34: return ( (pmfy^, pmfy^,... pmfy^), pmf^ ) > First return value: collection of 

likelihood distributions 
35: 

36: end procedure 


> Second return value: prior distribution on M 








is its democratized equal weighting of all joint events, the variety of which can 
provide a rich description of any high-probability joint events suggested by the 
data; however, this can also have disadvantages in that many low-quality joint 
events (i.e., those with low joint probability) may shape the result as much as a 
small number of high-quality results. Likewise, in sum-product inference, mul¬ 
tiple mutually exclusive joint events can simultaneously contribute to the result, 
raising the potential to erroneously infer implausible conclusions, because both 
may be plausible before considering the other. It is because of these disad¬ 
vantages in sum-product inference that max-product inference is widely used, 
because it forces the inferences to be jointly plausible (not simply individually, 
but as a whole), and because it drowns out noise from low-quality configura¬ 
tions that can diffuse and lower the certainty of conclusions in sum-product 
inference. 

Specifically, efficient max-product inference on sums of random variables 
would be quite useful; in addition to the examples of shared peptides, non¬ 
unique reads, etc. found throughout computational biology (wherein we have 
information about the sum of variables, but want to draw conclusions about the 
variables themselves), more efficient max-product inference would make pos¬ 
sible new inference algorithms on specialized classes of hidden Markov models 
(HMMs) where the transition probabilities from the state at index a to the 
state at index b depend on a function of either a + b oi a — b oi b — a. Such 
HMMs have applications in finance and time series analysis, where the k states 
at any layer are high-resolution discretizations of some quantity or price, and 
where the probability of a price moving up or down is influenced by the quan¬ 
tity up or down it moved since the previous time point. In a general HMM 
with k states and n layers of those states, performing either sum-product (via 
the forward-backward algorithm) or max-product (via the Viterbi algorithm) 
inference requires 0{nk^) steps; however, performing sum-product inference on 
the specialized class of HMMs mentioned above would require only 0{nk log(fc)) 
steps, because each layer can be processed as a two-node convolution tree. But 
finding the Viterbi path on such a model in 0(nklog{k)) steps is not currently 
feasible, because doing so would require performing max-convolution (where the 
max of all valid pairings is chosen rather than the sum over all valid pairings) 
in 0{k\og{k)) steps. 

However, despite the promise of max-product inference on sums of ran¬ 
dom variables, a fast practical solution that utilizes 0{k log(fc)) max-convolution 
(i.e., one with speed roughly comparable to FFT-based standard convolution) 
is not yet available for the general max-convolution problem problem. One spe¬ 
cial cases for use only when k = 2 [8 20 , can solve the problem in nlog(n) 
time by sorting the n variables in descending order of probability Pr(Xi = 1) > 
Pr(V 2 = 1) > • • • > Pr(X„ = 1) and then exploiting the property that any 
case where the number of “true” variables Xj = m must prefer the first m 
variables in the sorted order. This method understandably fails when k > 2 be¬ 
cause there is no guarantee of an ordering that will satisfy all dimensions (when 
fc = 2, increasing the probability of Pr{Xj = 1) has a useful effect of decreasing 
the probability of Pr(Vj = 0) by the same amount in order to preserve the 
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unitary value of the sum). When k > 2, a similar idea to the sorting approach 
can be used, but not without approximation or some method for exploring or 


optimizing the exponential space of joint events 15 . 

Adapting the probabilistic convolution tree algorithm from algorithm[^(by 
simply replacing all uses of * with *max when adding pairs of variables) achieves 
only an O(n^fc^) runtime for max-product inference because additions and sub¬ 
tractions between individual pairs of random variables will require 0{ki x ^ 2 ) 
time without a faster algorithm for max-convolution. It is tempting to try to 
derive an FFT equivalent to max-convolution, a subtle difference makes this 
challenging: Where standard convolution uses the operations (-I-, x) on real¬ 
valued numbers (a “ring”), max-convolution employs (max, x) (alternatively 
applying a log-transformation to the probabilities being convolved will negate 
them and thus change the problem to the equivalent (min, -I-) operations, called 
min-convolution, infimal convolution, or convolution on the “tropical semiring”). 
Regardless of which form is used, the employment of the max (or min in the 
min-convolution case) downgrades the operation from a ring to a “semiring” 
because the max and min operations have no inverse. Thus, the max-product 
addition of two discrete random variables takes a different form, which is no 
longer bijective to polynomial multiplication: 

pmfjy^lm] oc maxmaxPr(L = £) Pr(i? = r) Pr(M = £-I-r) 

£ r 

= maxPr(L = £) Pr(i? = m — tj 

= pmf^ *i„axpmfj^ 


where *max is the max-convolution operator. The loss of the bijective polynomial 
representation prevents the exploitation of the Lagrange form of polynomials, 
and thus there is no known k log(fc) algorithm for performing max-convolution. 

Excluding such highly specialized methods as the rank method of Babai 
(which achieves a runtime of 0 {kl^Jh 2 ^og{k^) time but only when the vector of 
length ^2 contains elements with value 0 or 00 ), the two most sophisticated max- 
convolution algorithms applicable to probabilistic inference are from Bussieck 
et al. and Bremner et al. [^. 

The method from Bussieck has a 0{k‘^) runtime in the worst case, but under 
a certain distribution values in the two vectors being convolved, the authors 
demonstrate an expected runtime of 0(fclog(fc)) |^. The approach works by 

starting the result Vm, pmf'^[TO] <- 00 and then proceeds by sorting the two 

vectors being convolved {L and R) in descending order. Their method then 
proceeds through both lists head-first to generate the first /clog(fc) sorted terms 
of V£, r pmfj;^[£] pmf^[r]. Each of these terms is used to update the appropriate 
index in the result vector pmf)y^[£ -I- r] •<— max(pmf[£ -|- r],pmf^[£] pmfjj[r]). 
Thus far, the algorithm is G 0(klog(k)), but there may be indices of pmf'^ 
that have not yet been set (they are still equal to — 00 ); each of these must 
be computed, and each such direct computation takes 0(k) time (if there are 
fl(k) such unset indices, then the overall runtime becomes 0(k^)). Despite the 
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significant achievement posed by the construction of this algorithm, the authors 
suggest that the runtime constant is quite high due to the overhead of the 
sophisticated algorithms used to sort the largest k\og{k) values while neither 
sorting nor even generating all values; they suggest that their result is of 
mostly theoretical import, and suggest using other methods in practice. 

The method of Bremner et al. (which was subsequently extended by Williams 
[22| ) draws a relationship between min-convolution and the necklace alignment 
problem, wherein two two collections of beads, each on its own circular string, 
are rotated to optimally align [^. Their method is the most sophisticated in ex¬ 
istence, and consists of a highly complicated exploitation of similarity to the all¬ 
pairs shortest paths problem to achieve a method with a subquadratic worst-case 
runtime of each max-convolution (the runtime of the Brem¬ 

ner et al. method can also be improved using a more recent method for so lvin g 


the all-pairs shortest paths problem, decreasing the runtime to 


22 


)• 


2f2( v^log(fe)) 

Even if it were possble to be implemented with runtime constant as optimized as 
EFT, the cost of using a max-convolution tree to solve the previously mentioned 
metagenomic max-product inference problem on n = 256 variables where each 
has k = 1024 states would be over 166 times slower than the cost of solving 
an equally sized sum-product problem with EFT convolution (the number of 
steps required was calculated numerically to avoid computing a closed form of 
the computational cost). 

Thus, even with significant mathematical sophistication of these two state- 
of-the-art methods, practically efficient max-product inference may be out of 
reach for even moderately sized problems. 


2 A numerical method for efficiently estimating 
max-convolution 

Here 1 will introduce a numerical method for estimating the max-convolution 
in k log(fc) time, which can be applied easily using existing high-performance nu¬ 
merical software libraries. This is essentially performed by transforming both 
the inputs and outputs of the EFT to achieve p—norm convolution, which in 
turn is used as an approximation for max-convolution via the Chebyshev norm. 

For a max-convolution between pmf^ and pmf^ 

pmf)^^[TO] = max pmfj;^[£] pmf^[m — £], 

at each m value, the shifted products terms can be rewritten as a simple vector 
yl™) where elements are defined by 

pnif^[TO — £]. 

Furthermore, because PMFs consist of nonnegative real values (or machine- 
precision representations), then this can be rewritten using the Chebyshev norm, 







which computes the maximum absolute value in the vector . Because 
comes from the product of PMF terms, it is also nonnegative and thus absolute 
values can be ignored in both the computation and the result: 


pmf)v^[m] 

= maxu 
£ 

(»")[£] 

pmf'^HI 

= lim II 


p—>-oo 

pmf'^HI 

= lim 

p—^oo \ 

^|uW[€]| 
t e 

PhiImH 

= lim 

p— >-oo \ 



And then each element of [P\ can be expanded back into its original factors: 
pmf'^jfm] = lim ( pmf^[£]^pmf B[m — ] . 

p^OO j 

At this point, a sufficiently large value p* is used in place of the limit p —>■ oo: 

pmf'^H « ^^pmfi[£]^ pmf^[m-£]^ j . 

At this point, it can be observed that every time elements of the PMFs 
pmf^[t'] and pmf^[m — P\ appear, they are raised to the p* power; thus, it is 
possible to change variables and let 

VL[i] = pmf^[£]^ 

Vr VR[r] = pmf^[r]^ , 


yielding 


pmf'^H « [ ^ VL[i]vR[m - 


A similar strategy can be made for pmf^^; it is possible to introduce another 
vector vm such that every element pmi'j^[m] is the result of 


pmf'^ 


M 


m 


UM mb’ 


VM[m] = '^VL [i] VR[m-£] 


And thus it becomes clear that vm is the result of standard convolution [not a 
max-convolution) between vr and vr. This suggests a numerical algorithm that 
can make use of existing FFT convolution libraries to compute vm = vl*vr in 
0{klog{k)) steps (algorithm]^. 
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Algorithm 2 Numerical max-convolution (initial version), a numerical 
method to estimate the max-convolution of two PMFs or nonnegative vectors. 
The parameters are two PMFs pmf^^ and pmf^ (by definition nonnegative) and 
the numerical value p* used for computation. The return value is a numerical 
estimate of the max-convolution pmf^^ *maxPinf_R- 

1: procedure MAxCONVOLUTlON(pmf^, pmf^, p*) 

2 : yi, vrai^[£Y 

3: Vr, VR[r] -s- pmf^[r]^ 

4: rjvr pmf^ * pmf^ [> Standard FFT convolution is used here 

5: Vm, pmiYilrn] <—VMlrn]^ 

6: return pmf)^^ > The return value is an estimate of the max-convolution result 

7: end procedure 


2.1 Reducing underflow 

The main caveat for numerical methods is often the loss of precision due 
to underflow when raising small probabilities to the power p* (in this case, no 
overflow occurs because the inputs are probabilities); such losses may not be 
undone later when raising to the ^ power. One way to limit unnecessary loss 
of precision is to recognize that since pmf)^^ will be normalized after it is es¬ 
timated, then it can be scaled arbitrarily during computation without altering 
the final result. For this reason it can be beneficial to scale a vector by divid¬ 
ing by its maximum element before raising it to the p* or ^ power; this will 
start the dominant elements close to 1, and thus allow them to lose little infor¬ 
mation to underflow. Using this strategy yields a slightly modified algorithm 
(algorithm [^. 

Note that, like standard implementations of fast convolution *, in imple¬ 
menting the fast *max operator it is possible to automatically choose between 
a naive implementation or the fast numerical implementation depending on the 
size of the problem; on very small problems {e.g. k = 8), the naive operation 
will have less overhead and can be a bit faster (the specific threshold can be 
chosen roughly by comparing the expected running time from the fast numerical 
method 0{k' log(A:')) (where k' is double the next integer power of two and the 
log is base two) to the 0{ki x ^ 2 )); this can reduce numerical error further. 

3 Results 

I briefly compare the speed and accuracy of the fast numerical max-convolution 
method as compared to naive max-convolution. Both methods are implemented 
in the Python programming language using floating point math and the numpy 
package (the fast numerical max-convolution method is implemented from al¬ 
gorithm 
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Algorithm 3 Numerical max-convolution (normalized version), a nu¬ 
merical method to estimate the max-convolution of two PMFs or nonnegative 
vectors (revised to reduce underflow). The parameters are two PMFs pmf^^ 
and pmf^ (by definition nonnegative) and the numerical value p* used for com¬ 
putation. The return value is a numerical estimate of the max-convolution 
pmfi, *max Pmffl-_ 

1: procedure MAxCONVOLUTlONREVlSED(pmf^, pmfjj, p*) 

2: ^max argmax^ 

3: rmax argmax,, pmf^fr] 

* «■ “‘M ^ 

6: vm pmf^ * pmf^ 

7: rrimax argmax„ VM[m] 

8: Vm, pmf;,H ^ 

9: return pmf^[tmax] pmf^[rmax] pmf^^ [> Undo the previously performed 

scaling 

10: end procedure 


0 Standard FFT convolution is used here 
> Correct any small errors in FFT convolution 


3.1 Practical efficiency of fast numerical max-convolution 

The speed of naive max-convolution is compared to the fast numerical es¬ 
timate. For each k G {32,64,128, 256, 512,1024, 2048,4096, 8192}, random pairs 
of vectors with uniform elements (ie., each element is drawn from uniform(0, 1)) 
The result of the max-convolution pmf)^ *max pmffl is computed via 0{k^) naive 
max-convolution and the fast numerical method. Figure demonstrates an 
substantial speedup in practice. 

3.2 Accuracy of fast numerical max-convolution compared 
to naive max-convolution 

A cursory empirical test of the numerical stability as a result of p* and the 
vector length k was performed. In figure the numerical stability was demon¬ 
strated on 64 random pairs of vectors for each length k G 128,256,512,1024. 
For all p* G {2,4, 8,16, 32, 64}, and the relative absolute error of each element 
in the result ot the max-convolution is computed - ^ /r . -where 

^ I exact[m\ 

numerical[m\ and exact[m\ refer to the value at index m of the numerical and 
naive results respectively. (Figure demonstrates the relationship between 
p*, fc, the relative absolute error and the magnitude of the exact result. 

Qualitatively, optimizing the numerical performance involves satisfying com¬ 
peting ideals: when p* is large, the problem solved converges to the max- 
convolution inference, but underflow becomes significant. When p* it’s too 
small, non-maximal terms contribute to the result (more similar to a standard 
convolution). 
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Figure 1: Runtime comparison between naive and fast numerical con¬ 
volution. Note the increasing gap between the two curves, which indicates a 
nonlinear speedup because of the log-scaling of both axes (because the runtime 
of the fast numerical method is dominated by FFT calculation). 
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k = 128 


k = 256 





Figure 2: The influence of the parameter p* on max-convolution ac¬ 
curacy. For each k G {256, 512,1024}, 64 replicate max-convolutions are per¬ 
formed to compare the relative error of the fast numerical method compared 
to the naive method. This is performed for different values of p*; lower values 
of p* perform well when the exact result is close to zero, and higher values of 
p* perform better otherwise, and this relationship seems invariant of k when 
the data are scaled in the manner presented (they are scaled so that the largest 
element has value 1). 
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Although more sophisticated numerical analysis would almost certainly yield 
larger improvements to the method, a simple improvement is exploited: Gener¬ 
ally the relative error is fairly low, but only becomes high in these experiments 
when the exact value at that index is close to zero (this is intuitive from the 
formula for relative absolute error). Since underflow is the only numerical con¬ 
sideration (because the values are normalized to the maximum), then it follows 
that a result that is not close to zero at some index is convergent when p* is 
substantially large (if it suffered from too much underflow, then it would ap¬ 
proach zero quickly). Therefore, when using a high value of p*, indices where 
the numerical solution is close to zero indicate that the potential for numerical 
error, and suggest that a smaller value of p* could be used for those indices. This 
yields a further improvement where a more accurate result can be constructed 
from two calls of algorithm this improved method is shown in algorithm]^ 
and runs in roughly twice as many steps as algorithm (still G 0(A: log(fc))). 
Note that this piecewise method could be trivially extended to use more than 
two values of p* , increasing accuracy at the expense of additional runtime (al¬ 
though, assuming the results using the different values of p* are computed in 
decreasing order, then the routine could potentially terminate once the result 
has been estimated at all indices with adequate numeric stability). 

3.3 Using fast numerical convolution to solve a probabilis¬ 
tic subset sum problem 

Lastly, a three-way piecewise implementation of max-convolution similar to 
the one shown in algorithm but using p* G {4,32,64} (the Python code of 
this three-way piecewise method is given in the accompanying Python demon¬ 
stration code) is used to solve a simulated probabilistic generalization of the 
the subset sum problem. In this problem, n = 32 people go shopping and each 
person j m{l, 2,... nj buys exactly one of two items (the price of the item they 
do purchase is and the price of the item they do not purchase is 

where the costs of both items for each person j are unknown to us. Then, given 
fuzzy knowledge about the costs of these items with all prices discretized into 
k = 256 bins (ie., Vj, pm^^. [^] where £ G (0,1,... fc — 1}) and given fuzzy 
knowledge about the total amount spent (pmf^, where M = we try 

to infer the amount spent by each person (ie., for each person j, estimating 

(true)-, 
t^j )■ 

Data are generated as follows: At each variable Xj,j G {l,2,...n}, two 
means are randomly sampled ~ uniform{0,k — 1), and a dis¬ 

cretized Gaussian PMF is centered about each mean (the standard deviations 
of the Gaussians are each sampled ~ uniform{0, ^)). A vector 

proportional to the PMF pmfj^^. \(\ is computed using the sum of these Gaus¬ 
sians with a vector of uniform noise ^ unif orm(0,0.0001). The likeli¬ 

hood distribution on the sum M = Xj is generated by adding a Gaussian 

with mean J2j variance 0.005 x {nk — {n — 1)) (i.e., 0.005x the 
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Algorithm 4 Numerical max-convolution (piecewise normalized ver¬ 
sion), a numerical method to estimate the max-convolution of two PMFs or 
nonnegative vectors (further revised to strategically choose p*). The param¬ 
eters are two PMFs pmf^ and pmf^ both with with k outcomes (and both 
by definition nonnegative). The return value is a numerical estimate of the 
max-convolution pmf^^ *max pmfjj. This algorithm calls the revised method 
maxConvolutionRevised from algorithm Note that the values of p’lower^ 
Phigher^ and T Specified here may be chosen to optimize performance on a specific 
application. 

1: procedure MAxCONVOLUTlONPlECEWlSE(pmf 2 ^, pmfjj) 

2: Plower -^4.0 

3- Phigher ^ 32.0 

4: T ■«— 0.6 t> This is a two-way piecewise implementation; higher orders will be 

slower and more accurate 

5: fstabie[m] ^ maxConvolutionRevised{pmij^,pmfj^,Pi^^^j.) > Numerically 

stable upper bound 

6: faggressive[m] ■«— maxConvolutionRevised{pmf pmi > More 

aggressive calculation 
7: for m to k, m+ = 1 do 

8: if f higher [m] > T then 

9: f[m] ^ faggressive[m] t> When Mgh-p* result is large enough, 

underflow isn’t an issue 
10: else 

11: /[m] fstabielm] > Where the high-p* result is small, use fstable to 

avoid underflow 

12: end if 

13: end for 

14: return / 

15: end procedure 
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possible number of outcomes for M), plus point-wise samples of uniform noise 
Q,(out)j£] ^ unif orm{0, 0.0001). 

On each problem instance, likelihoods for all inputs (pmfy^, pmf^^,... pmfy^) 
are computed twice, once using naive max-convolution and once using the nu¬ 
merical method. Note that even though these values of n and k do not appear 
particularly large, the full max-convolution tree that they produce will will com¬ 
pute several max-convolutions on the order of 0{k) and a few on the order of 
0{n X k), which in this case can be 32 x 256 = 8192. For this reason, computing 
the likelihood curve with the naive result requires 159 seconds, while the fast 
numerical approach takes 0.935 seconds to compute a highly similar result. 

A single likelihood distribution pmfy^ for one particular j G {l,2,...n}, 
which was computed using the naive method, the fast numerical method is 
plotted in figure This figure also demonstrates the utility of max-product 
inference by also showing the result from sum-product inference, which is much 
less informative (and does not have a mode close to the correct answer. 

4 Discussion 

Although its ethos may ultimately limit the utility of this method to nu¬ 
merical settings where small errors are tolerable (as opposed to the to more 
general theoretical papers previously mentioned [^|^), numerical method pro¬ 
posed here gives a simple and very fast estimate of the max-convolution result, 
which could allow use of numerical max-convolution (or max-product inference 
on the sums and differences between discrete distributions) in settings where it is 
currently far too computationally expensive. The largest caveat to the method 
is the inaccuracy that can occur due to numerical bottlenecks {e.g., underflow); 
however, for many problems {e.g., practical applications of the max-product 
inference problem in figure [^, the numerical method is sufficient to perform 
high-quality inference, but in a dramatically faster time. 

Furthermore, the connection between the max-convolution problem and the 
all-pairs shortest path problem from graph theory means that the fast nu¬ 
merical method may be used to compute fast numerical approximations to that 
important computer science problem. Such fast numerical estimates could com¬ 
plement theoretical solutions to that problem. 

A more in-depth theoretical analysis of the algorithm’s error would likely 
yield multiple opportunities to modify the algorithm in order to decrease er¬ 
ror. For example, one possible improvement could be performed by using 
log-transformed real values: in log-transformed space, raising to the power p* 
would be equivalent to scaling by p*, and would not produce significant un¬ 
derflow. Furthermore, the operations required by FFT convolution could be 
performed in log-space by translating the ring (-l-,x) on real values to its equiv¬ 
alent ring (log_|_,log,^) = (log_|_,-|-) on log-transformed values, where the oper¬ 
ation log_|_(a:, y) is performed by dividing out the greater of the two arguments 
X (w.l.o.g.) and then computing log(l + y) = log(l -I- z) via Taylor series; per¬ 
forming the Cooley-Tukey FFT on log-transformed values (or possibly using 
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Figure 3: Using fast numerical max-convolution for max-product infer¬ 
ence. A single likelihood distribution for one particular j G {1, 2,... n} from 
the probabilistic generalization of the subset-sum problem is shown. This dis¬ 
tribution is computed via a probabilistic max-convolution tree (the problem 
is solved twice, once with naive max-convolution and once via fast numerical 
max-convolution). The true mode value for that j is indicated by the 

bar beneath the largest mode. To compare the results of different types of in¬ 
ference, the sum-product result from the convolution tree (using the standard 
convolution operator) is also plotted; note that sum-product inference produces 
a less discriminative likelihood curve because many possible joint events have 
diffused into it. 
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a different FFT algorithm that is particularly well suited for log-transformed 
values) could represent one route for improving the accuracy. 

In a similar vein, more sophisticated techniques for locally choosing among 
a small number of values for p* (compared to a simple piecewise function on 
two possible values oi p*). For instance, under roughly uniform distributions of 
values in the two vectors being max-convolved, the values closest to zero (and 
thus having higher chance of having high relative absolute error) will occur more 
commonly at the first and last indices of the numerical estimate (because those 
indices take the maximum over smaller collections of elements, and are thus more 
likely to be smaller values). Similar attention could be put toward scaling the 
vectors prior to taking elements to the p* (compared to the current procedure 
of dividing by the maximum element value) may minimize the underflow on a 
large number of points with large values. A most exciting possibility would be 
that having an accurate estimate of the max-convolution result somehow could 
be used to compute a more accurate result {e.g., using approximate results 
from different p* and exploiting the property || • ||p > || • ||p +5 when p > 1 
and i5 > 0). Such directions of future research could possibly solving the max- 
convolution iteratively over a bounded or constant number of subproblems where 
each subproblem requires 0{k log(fc)), by first computing initial estimates of the 
max-convolution result with the numerical method presented here, and then 
using those initial estimates to parameterize a subsequent call to the numerical 
method in a manner reminiscent of the QR algorithm for eigendecomposition 
. From algorithm it seems highly likely that there will be more ways 
by which an initial result can be used to obtain a higher-accuracy result. 

Furthermore, even pursuing methods for obtaining very high accuracy with 
large values of p* may not always be of substantial interest. Indeed, even if 
no improvement to accuracy is ever presented, the design and parameteriza¬ 
tion of machine learning methods (including graphical models) has traditionally 
been empirically driven, and the use of exact max-product inference is hardly 
sacrosanct in every application (as opposed to inference somewhere between 
sum-product and max-product). From this perspective, rather than choosing 
p* as a static constant value p* = 1 (i.e., performing sum-product inference) 
or p* —>■ oo {i.e., performing max-product inference), p* could be viewed as a 
hyperparameter that is used to position inference on a continuum somewhere 
between all joint events contributing equally to the end result (sum-product) 
and only the best joint event contributing (max-product), and intermediate 
values of p* would establish a preference for the top few joint events. In this 
sense, the value chosen for p* could be driven by the data, and the problem 
of p-norm convolution (where a finite p is desired, rather than max-convolution 
where p —)■ oo) can already be solved with very high accuracy by the proposed 
numerical method for any moderate choice of p*. 
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5 Availability 

A simple illustration of the fast numeric max-convolution method in Python 
(using the numpy package), including a three-way piecewise implementation, is 
available at https://bitbucket.org/orserang/fast-numerical-max-convolution. 
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